[Performance] Improve MiMo-Audio tokenizer decoding performance by qibaoyuan · Pull Request #2183 · vllm-project/vllm-omni

qibaoyuan · 2026-03-25T11:50:09Z

Purpose

To improve the decoding capability of the audio tokenizer in the MiMo-Audio model, we focus on optimizing its efficiency, as it is frequently invoked in asynchronous scenarios. Improving its performance is therefore critical. Our approach leverages CUDA Graphs to accelerate execution.

Key changes include:

Attention.forward_fixed — Replaces flash_attn_varlen_func with F.scaled_dot_product_attention, operating on 3D tensors [B, L, D], thereby avoiding variable-length packing.
TransformerLayer.forward_fixed — Combines self_attn.forward_fixed with the feed-forward network (FFN).
CausalConvTranspose1d.forward_fixed — Applies transposed convolution directly on 3D tensors without using masked_select.
TransformerVocos.forward_fixed — Implements a mask-free forward path for the vocoder.
AudioDecoder.forward_fixed — Constructs the full decoder pipeline: dconv1 → transformer layers → dconv2 → vocoder.
MiMoAudioTokenizer.decode_fixed — Wraps the complete decoding process, including decode_vq, padding, and decoder.forward_fixed.

Test Plan

export MIMO_AUDIO_TOKENIZER_PATH="XiaomiMiMo/MiMo-Audio-Tokenizer"

python3 -u end2end.py \
--stage-configs-path ./vllm_omni/model_executor/stage_configs/mimo_audio.yaml  \
--model  "XiaomiMiMo/MiMo-Audio-7B-Instruct" \
--query-type tts_sft_with_audio \
--audio_path ./examples/offline_inference/mimo_audio/beijing.mp3 \
--text "我还知道东北有杀猪菜，是把猪血肠、五花肉、酸菜等放在一块炖的，味道很浓郁。"

Test Result

Request ID: 0_3581f0d8-1ec1-4063-a223-72fa6a95b4a1, Text saved to ./output_audio/tts_sft_with_audio/0_3581f0d8-1ec1-4063-a223-72fa6a95b4a1.txt

Request ID: 0_3581f0d8-1ec1-4063-a223-72fa6a95b4a1, Audio saved to ./output_audio/tts_sft_with_audio/0_3581f0d8-1ec1-4063-a223-72fa6a95b4a1.wav

0_3581f0d8-1ec1-4063-a223-72fa6a95b4a1.wav

Essential Elements of an Effective PR Description Checklist

[ x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[ x] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
[x ] The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

# Conflicts: # vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

hsliuustc0106 · 2026-04-30T10:10:45Z

will this pr solve the acc issue?

hsliuustc0106 · 2026-05-06T12:36:58Z

can you provide the e2e improvement from this PR?

qibaoyuan · 2026-05-07T12:21:43Z

can you provide the e2e improvement from this PR?

CUDA Graph vs Eager Execution Performance Comparison

Compared with Eager execution, CUDA Graph delivers significant latency reduction and inference acceleration in both non-streaming and streaming modes.

Performance Table

Mode	Eager (ms)	CUDA Graph (ms)	Speedup
Non-streaming	60	5.5	10.93×
Streaming	219.3	31.1	7.05×

Key Observations

Non-streaming mode: CUDA Graph reduces latency from 60 ms to 5.5 ms, achieving a 10.93× speedup.
Streaming mode: Latency decreases from 219.3 ms to 31.1 ms, resulting in a 7.05× acceleration.
Overall, CUDA Graph significantly reduces runtime overhead and improves execution efficiency, especially for latency-sensitive inference workloads.

…r.synchronize Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

linyueqian

lgtm

…-project#2183) Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com> Co-authored-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

…-project#2183) Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com> Co-authored-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com>

qibaoyuan and others added 30 commits March 6, 2026 15:30

[mimo-audio] tok example

0b5ed57

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

09e17eb

[mimo-audio] example

e10ea18

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

2c4e68c

[mimo-audio] example

80d4f24

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

bd4ed9d

Merge branch 'vllm-project:main' into tok_cg

b55ac59

Merge branch 'vllm-project:main' into tok_cg

8140d2d

Merge branch 'vllm-project:main' into tok_cg

8b022b3

Merge branch 'vllm-project:main' into tok_cg

4deee49

Merge branch 'vllm-project:main' into tok_cg

6124bf8

Merge branch 'vllm-project:main' into tok_cg

02156e5

Merge branch 'vllm-project:main' into tok_cg

c77f442

Merge branch 'vllm-project:main' into tok_cg

9dbb293

Merge branch 'vllm-project:main' into tok_cg

6964957

Merge branch 'vllm-project:main' into tok_cg

0109fd1

Merge branch 'vllm-project:main' into tok_cg

e3700d0

Merge branch 'vllm-project:main' into tok_cg

2439724

Merge branch 'vllm-project:main' into tok_cg

be6206f

Merge branch 'vllm-project:main' into tok_cg

1c1ff70

Merge branch 'vllm-project:main' into tok_cg

33efe81

Merge branch 'vllm-project:main' into tok_cg

05ef764

Merge branch 'vllm-project:main' into tok_cg

24f85e0

Merge branch 'vllm-project:main' into tok_cg

57da820

Merge remote-tracking branch 'origin/main' into tok_cg

f49a6b8

# Conflicts: # vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py

Merge remote-tracking branch 'origin/tok_cg' into tok_cg

cd06d99

[mimo-audio] revert

b10c4e0

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] cg refit

f2dd06b

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] streaming decode

0c41a3e

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

1b85225

qibaoyuan and others added 9 commits April 27, 2026 17:19

Merge branch 'vllm-project:main' into tok_cg

e993155

Merge branch 'vllm-project:main' into tok_cg

a6cbbba

Merge branch 'main' into tok_cg

78dffab

Merge branch 'vllm-project:main' into tok_cg

ea07bd5

Merge branch 'vllm-project:main' into tok_cg

e3ec2ad

Merge branch 'vllm-project:main' into tok_cg

885bea0

Merge branch 'vllm-project:main' into tok_cg

cf8250d

Merge branch 'vllm-project:main' into tok_cg

b1fbccb

Merge branch 'vllm-project:main' into tok_cg

e0f006d

qibaoyuan and others added 3 commits May 6, 2026 08:31

Merge branch 'vllm-project:main' into tok_cg

9476ba8

Merge branch 'vllm-project:main' into tok_cg

7cf65cb

Merge branch 'main' into tok_cg

0385288

qibaoyuan added 2 commits May 7, 2026 11:26

Merge branch 'vllm-project:main' into tok_cg

3f1d87c

Merge branch 'vllm-project:main' into tok_cg

0a74dc2

qibaoyuan and others added 3 commits May 8, 2026 08:47

Merge branch 'vllm-project:main' into tok_cg

3d6c166

[mimo-audio] torch.cuda.synchronize is banned: Use torch.accelerato…

5ee0fbc

…r.synchronize Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

f3b1b39

qibaoyuan requested review from ZeldaHuang, linyueqian, princepride, yenuo26 and yuanheng-zhao as code owners May 9, 2026 01:06

Merge branch 'vllm-project:main' into tok_cg

3cd0d34

linyueqian approved these changes May 11, 2026

View reviewed changes

hsliuustc0106 merged commit e108802 into vllm-project:main May 11, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Improve MiMo-Audio tokenizer decoding performance#2183

[Performance] Improve MiMo-Audio tokenizer decoding performance#2183
hsliuustc0106 merged 111 commits into
vllm-project:mainfrom
qibaoyuan:tok_cg

qibaoyuan commented Mar 25, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 30, 2026

Uh oh!

hsliuustc0106 commented May 6, 2026

Uh oh!

qibaoyuan commented May 7, 2026 •

edited

Loading

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

qibaoyuan commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

hsliuustc0106 commented Apr 30, 2026

Uh oh!

hsliuustc0106 commented May 6, 2026

Uh oh!

qibaoyuan commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CUDA Graph vs Eager Execution Performance Comparison

Performance Table

Key Observations

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qibaoyuan commented Mar 25, 2026 •

edited

Loading

qibaoyuan commented May 7, 2026 •

edited

Loading